{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "d2f3b2dd",
   "metadata": {},
   "source": [
    "# RT-DETRv2 Object Detector"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6840ecc2",
   "metadata": {},
   "source": [
    "<h1>Table of Contents<span class=\"tocSkip\"></span></h1>\n",
    "<div class=\"toc\">\n",
    "   <ul class=\"toc-item\">\n",
    "      <li>\n",
    "         <span><a href=\"#RT-DETRv2 Object Detector\" data-toc-modified-id=\"RT-DETRv2-Object-Detector-1\"><span class=\"toc-item-num\">1&nbsp;&nbsp;</span>RT-DETRv2 Object Detector</a></span>\n",
    "         <ul class=\"toc-item\">\n",
    "            <li><span><a href=\"#Introduction\" data-toc-modified-id=\"Introduction-1.1\"><span class=\"toc-item-num\">1.1&nbsp;&nbsp;</span>Introduction</a></span></li>\n",
    "            <li>\n",
    "               <span><a href=\"#Earlier-works\" data-toc-modified-id=\"Earlier-works-1.2\"><span class=\"toc-item-num\">1.2&nbsp;&nbsp;</span>Earlier works</a></span>\n",
    "               <ul class=\"toc-item\">\n",
    "                  <li><span><a href=\"#RT-DETR\" data-toc-modified-id=\"RT-DETR-1.2.1\"><span class=\"toc-item-num\">1.2.1&nbsp;&nbsp;</span>RT-DETR</a></span></li>\n",
    "                  <!-- <li><span><a href=\"#Fast-R-CNN\" data-toc-modified-id=\"Fast-R-CNN-1.2.2\"><span class=\"toc-item-num\">1.2.2&nbsp;&nbsp;</span>Fast R-CNN</a></span></li> -->\n",
    "               </ul>\n",
    "            </li>\n",
    "            <li>\n",
    "               <span><a href=\"#RT-DETRv2\" data-toc-modified-id=\"RT-DETRv2-1.3\"><span class=\"toc-item-num\">1.3&nbsp;&nbsp;</span>RT-DETRv2</a></span>\n",
    "               <!-- <ul class=\"toc-item\">\n",
    "                  <li><span><a href=\"#Region-Proposal-Network-(RPN)\" data-toc-modified-id=\"Region-Proposal-Network-(RPN)-1.3.1\"><span class=\"toc-item-num\">1.3.1&nbsp;&nbsp;</span>Region Proposal Network (RPN)</a></span></li>\n",
    "               </ul> -->\n",
    "            </li>\n",
    "            <li><span><a href=\"#Implementation-in-arcgis.learn\" data-toc-modified-id=\"Implementation-in-arcgis.learn-1.4\"><span class=\"toc-item-num\">1.4&nbsp;&nbsp;</span>Implementation in <code>arcgis.learn</code></a></span></li>\n",
    "            <li><span><a href=\"#References\" data-toc-modified-id=\"References-1.5\"><span class=\"toc-item-num\">1.5&nbsp;&nbsp;</span>References</a></span></li>\n",
    "         </ul>\n",
    "      </li>\n",
    "   </ul>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "919705ef",
   "metadata": {},
   "source": [
    "## Introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d8133140",
   "metadata": {},
   "source": [
    "We have seen how the object detection models such as [SSD](https://developers.arcgis.com/python/guide/how-ssd-works/), [RetinaNet](https://developers.arcgis.com/python/guide/how-retinanet-works/), and [YOLOv3](https://developers.arcgis.com/python/latest/guide/yolov3-object-detector/) work. However, these models primarily are Convolution Neaural Network (CNN) based architectures. Eversince the development of the revolutionary [Transformer](https://arxiv.org/pdf/1706.03762) architecture in the language domain, researchers had been trying to incorporate the core idea of the transformers - the self attention, to vision domain. \n",
    "\n",
    "Subsequently, this inspired the development of a CNN-transformer hybrid Object Detection model called the [Detection Transformer (DETR)](https://arxiv.org/pdf/2005.12872). The DETR extracted it's features through a Convolutional backbone and applied a transformer on top of those features to directly predict a set of detections. The DETR model however, did not beat the popular YOLO family models in terms of both accuracy and speed, but it was a beginning in the direction of fully transformer based Object detection models. The idea of self attention was attractive because it gave global context to the model from the very first layer, however being an O(n<sup>2</sup>) complexity algoithm meant high compute time which, unlike the YOLO family, rendered DETR (much like other Vision Transformer based architectures) slow and unable to process images at real-time. \n",
    "\n",
    "With the recent development of the [Real time DETR (RT-DETR)](https://arxiv.org/pdf/2304.08069) and its successor, [RT-DETRv2](https://arxiv.org/pdf/2407.17140), DETRs have finally outperformed YOLO family in both speed and accuracy. In this guide, we will learn more about RT-DETRv2 and how we can use it for your own tasks using `arcgis.learn`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b18ef657",
   "metadata": {},
   "source": [
    "<center><img src=\"../../static/img/Object_detection_through_RT_DETRv2.png\" /></center>\n",
    "<center>Figure 1. Object Detection using RT-DETRv2 </center>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0cec132a",
   "metadata": {},
   "source": [
    "## Earlier works"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "52353a58",
   "metadata": {},
   "source": [
    "### RT-DETR"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "87b3984c",
   "metadata": {},
   "source": [
    "The YOLO series had been the most popular framework for real-time object detection due to its reasonable trade-off between speed and accuracy. However,the speed and accuracy of YOLOs are negatively affected by the NMS. Recently, end-to-end Transformer-based detectors (DETRs) have provided an alternative to eliminating NMS. Nevertheless, the high computational cost limits their practicality and hinders them from fully exploiting the advantage of excluding NMS. The Real-Time DEtection TRansformer (RT-DETR) became the first end-to-end real-time object detector to addresses this issue by removing the need for NMS and making the architecture computationally efficient, outperforms YOLO series in both speed and accuracy. It's architecture (Figure 2) consists of three main components: a hybrid encoder, a query selection mechanism, and a Transformer decoder.\n",
    "\n",
    "RT-DETR employs a **hybrid encoder** to extract rich, multi-scale features while maintaining real-time efficiency. Unlike traditional Transformer-based architectures that process features across scales simultaneously (leading to high computation costs), RT-DETR decouples intra-scale interactions and cross-scale fusion (Figure 2). Specifically, The features from the last three stages of a CNN backbone are fed into the encoder. The efficient hybrid encoder transforms multi-scale features into a sequence of image features through the Attention-based Intra-scale Feature Interaction (AIFI) and the CNN-based Cross-scale Feature Fusion (CCFF). The intra-scale interaction step refines features at each level independently, while the cross-scale fusion step aggregates information across different scales efficiently. This design allows for high-quality feature representation with reduced computational overhead.\n",
    "\n",
    "A critical innovation in RT-DETR is its **uncertainty-minimal query selection** mechanism. Instead of generating random object queries, RT-DETR selects high-quality initial queries based on an uncertainty metric, ensuring that the most relevant object representations are passed to the decoder. This step improves detection accuracy and convergence speed.\n",
    "\n",
    "The **Transformer decoder** in RT-DETR directly predicts object bounding boxes without relying on post-processing techniques like NMS. By leveraging set-based predictions, each query predicts an object in a one-to-one fashion, mitigating duplicate detections. The decoder layers refine these predictions iteratively, and RT-DETR allows adjusting the number of decoder layers to balance speed and accuracy dynamically.\n",
    "\n",
    "The RT-DETR-R50 (resnet50 backbone) achieves **53.1% AP on COCO** and **108 FPS on T4 GPU**, outperforming previously advanced YOLOs in both speed and accuracy. Furthermore, pretraining on Objects365 boosts its performance to 55.3% AP with the same backbone, making it a strong alternative for real-time object detection tasks."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "601a868f",
   "metadata": {},
   "source": [
    "<div style=\"text-align: center;\">\n",
    "  <img src=\"../../static/img/rtdetr_architecture.png\" style=\"width:70%;\" />\n",
    "  <div style=\"max-width: 1000px; text-align: justify; margin: auto;\">\n",
    "    <strong>Figure 2.</strong> Overview of RT-DETR Architecture. The features from the last three stages of a CNN backbone are fed into the encoder. \n",
    "    The efficient hybrid encoder transforms multi-scale features into a sequence of image features through the Attention-based Intra-scale Feature \n",
    "    Interaction (AIFI) and the CNN-based Cross-scale Feature Fusion (CCFF). Then, the uncertainty-minimal query selection selects a fixed number \n",
    "    of encoder features to serve as initial object queries for the decoder. Finally, the decoder with auxiliary prediction heads iteratively optimizes \n",
    "    object queries to generate categories and boxes.\n",
    "  </div>\n",
    "</div>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2bacc150",
   "metadata": {},
   "source": [
    "## RT-DETRv2\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b3d2671d",
   "metadata": {},
   "source": [
    "RT-DETRv2 builds upon the previous state-of-the-art real-time detector, RT-DETR, and opens up\n",
    "a set of bag-of-freebies for flexibility and practicality, as well as optimizing the training strategy to\n",
    "achieve enhanced performance. **The architecture of RT-DETRv2 remains the same as RT-DETR, with only modifications to the deformable attention module of the decoder**.\n",
    "\n",
    "\n",
    "RT-DETRv2 assigns a distinct number of sampling points for features at different scales within the deformable attention module of the decoder. This approach allows for more effective multi-scale feature extraction by acknowledging the inherent differences across feature scales.\n",
    "\n",
    "To enhance the practicality of the model, RT-DETRv2 introduces an optional discrete sampling operator to replace the original grid_sample operator. This substitution addresses deployment constraints typically associated with detection transformers, facilitating broader application and integration.\n",
    "\n",
    "Employing strategies like dynamic data augmentation and scale-adaptive hyperparameter customization, improve the model's performance without compromising inference speed, offering a balanced approach to training efficiency and accuracy. RT-DETRv2 outperforms RT-DETR at different scales of detectors without loss of speed."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f3fce3e3",
   "metadata": {},
   "source": [
    "## Implementation in `arcgis.learn`"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2e04a5a6",
   "metadata": {},
   "source": [
    "You can create a RT-DETRv2 model in `arcgis.learn` using a single line of code.  \n",
    "```\n",
    "model = RTDetrV2(data)\n",
    "```\n",
    "Where ``data`` is the databunch that you would have prepared using `prepare_data` function.\n",
    "Optionally, a ``backbone`` model from the ResNet family can be provided. The default is set to `resnet50`.\n",
    "\n",
    "For more information about the API, please go to the [API reference](https://developers.arcgis.com/python/latest/api-reference/arcgis.learn.toc.html#rtdetrv2)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e758b514",
   "metadata": {},
   "source": [
    "## References"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "29fb338b",
   "metadata": {},
   "source": [
    "[1] Wenyu Lv, Yian Zhao, Qinyao Chang, Kui Huang, Guanzhong Wang, Yi Liu: “RT-DETRv2: Improved Baseline with Bag-of-Freebies for Real-Time Detection Transformer”, 2024; [https://arxiv.org/abs/2407.17140 arXiv:2407.17140].\n",
    "\n",
    "[2] Yian Zhao, Wenyu Lv, Shangliang Xu, Jinman Wei, Guanzhong Wang, Qingqing Dang, Yi Liu, Jie Chen: “DETRs Beat YOLOs on Real-time Object Detection”, 2024; [https://arxiv.org/abs/2304.08069 arXiv:2304.08069].\n",
    "\n",
    "[3] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko: “End-to-End Object Detection with Transformers”, 2020; [https://arxiv.org/abs/2005.12872 arXiv:2005.12872].\n",
    "\n",
    "[4] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby: “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”, 2020;[https://arxiv.org/abs/2010.11929 arXiv:2010.11929].\n",
    "\n",
    "[5] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin: “Attention Is All You Need”, 2017; [https://arxiv.org/abs/1706.03762 arXiv:1706.03762]."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
